Rewrote textrank_sentences() #7

emillykkejensen · 2019-01-21T12:54:59Z

I rewrote the textrank_sentences() as it could not run my dataset (ran for 3 days without finishing). In doing so I added pbapply to show progress for the sentence_dist function as well as enable parallelization.

A pretty solid upgrad to an already pretty solid function, if I should say so my self!

…dded progress bar and parallelization

jwijffels · 2019-01-21T15:59:45Z

I think you should really test out using the minhash algorithm. That is more a solution if you have large volumes of sentences as if you don't use it it will have to calculate all pairwise sentence similarities. Please try out the minhash algorithm.

jwijffels · 2019-01-21T21:31:26Z

On another note. Some advise in reducing the dimensionality of the number of sentences: it is better to use text clustering first (e.g. using the topicmodels R package or the BTM package (https://cran.r-project.org/web/packages/BTM/index.html) and inside that cluster apply textrank

emillykkejensen · 2019-01-22T07:16:15Z

Thanks for the advise - I'm trying out different approaches and have had a look at the minhash algorithm - but it takes longer to run the textrank_candidates_lsh function itself then running the rewritten textrank_sentences. And that's only when it will run - if I run it on all my 12.000 sentences - it will fail and throw an error..

Also had a look at the BTM package, but again it takes a long time to complete. Really the fastes way to do it, is using the textrank_sentences.

jwijffels · 2019-01-22T19:47:00Z

I've read a bit the changes. Am I correct that the speed difference is basically because you calculate overlap in batches by groups of textrank_id's and because you parallelise the mapply loop?

emillykkejensen · 2019-01-25T15:25:14Z

Not really - actually haven't even used the parallelise function. The reason is, that I have used data.table and thus using reference = lower memoy and faster speed.

jwijffels · 2019-01-25T15:58:03Z

Ok, but in that case, can you drop the usage of the pbapply package. In general I'm against adding package dependencies which are not needed. Adding a dependency on another package seems to me overkill. Why not add a simple trace argument and print out something every say 1000 comparisons. That removes another dependency which might give problems in maintaining later on.

emillykkejensen · 2019-01-30T07:49:46Z

That is a good principle - one I tend to stick with as well, but guess i got carried away :)

I'll have a look at it and write the pbapply package out..

jwijffels · 2019-01-30T08:19:00Z

great

…textrank_sentences()

emillykkejensen · 2019-02-04T09:53:47Z

Removed the use of pdapply and replaced it with cat - it's not as pretty but it gets the job done, if you want to monitor the progress of the function..

jwijffels · 2019-02-07T08:24:19Z

Thanks, I'm going to review this soon and incorporate

jwijffels · 2019-02-14T22:05:24Z

I've reviewed your code and updated it according to what I thought was better readeable. Can you try it out on your own dataset and let me know if this is fine.

Rewrote textrank_sentences() to improve speed and mem-consumption + a…

0278bfe

…dded progress bar and parallelization

Replaced pdapply with base R cat for monitoring function progress of …

553df84

…textrank_sentences()

incorporate distance calculation by textrank_id_1

9a78eed

jwijffels mentioned this pull request Jan 31, 2020

Add Parallelism (again) #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrote textrank_sentences() #7

Rewrote textrank_sentences() #7

emillykkejensen commented Jan 21, 2019

jwijffels commented Jan 21, 2019

jwijffels commented Jan 21, 2019

emillykkejensen commented Jan 22, 2019 •

edited

Loading

jwijffels commented Jan 22, 2019

emillykkejensen commented Jan 25, 2019

jwijffels commented Jan 25, 2019

emillykkejensen commented Jan 30, 2019

jwijffels commented Jan 30, 2019

emillykkejensen commented Feb 4, 2019

jwijffels commented Feb 7, 2019

jwijffels commented Feb 14, 2019

Rewrote textrank_sentences() #7

Are you sure you want to change the base?

Rewrote textrank_sentences() #7

Conversation

emillykkejensen commented Jan 21, 2019

jwijffels commented Jan 21, 2019

jwijffels commented Jan 21, 2019

emillykkejensen commented Jan 22, 2019 • edited Loading

jwijffels commented Jan 22, 2019

emillykkejensen commented Jan 25, 2019

jwijffels commented Jan 25, 2019

emillykkejensen commented Jan 30, 2019

jwijffels commented Jan 30, 2019

emillykkejensen commented Feb 4, 2019

jwijffels commented Feb 7, 2019

jwijffels commented Feb 14, 2019

emillykkejensen commented Jan 22, 2019 •

edited

Loading